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Abstract 

In this contribution, wc propose a generic online (also sometimes called adaptive or recursive) 
version of the Expectation-Maximisation (EM) algorithm applicable to laten t variable models 
of independent observations. Compared to the algorithm of IXitterington 1 1984 ). this approach is 
more directly connected to the usual EM algorithm and does not rely on integration with respect to 
the complete data distribution. The resulting algorithm is usually simpler and is shown to achieve 
convergence to the stationary points of the KuUback-Leibler divergence between the marginal 
distribution of the observation and the model distribution at the optimal rate, i.e., that of the 
maximum likelihood estimator. In addition, the proposed approach is also suitable for conditional 
(or regression) models, as illustrated in the case of the mixture of linear regressions model. 

Keywords: Latent data models, Expectation-Maximisation, adaptive algorithms, online estima- 
tion, stochastic approximation, Polyak-Ruppert averaging, mixture of regressions. 



1 Introduction 



The EM (Expectation-Maximisation) algorithm (jPempster et al.l . 119771 ) is a popular tool for maximum- 
likelihood (or maximum a posteriori) estimation. The common strand to problems where this ap- 
proach is applicable is a notion of incomplete data, which includes the conventional sense of missing 
data but is much broader than that. The EM algorithm demonstrates its strength in situations 
where some hypothetical experiments yields complete data that are related to the parameters more 
conveniently than the measurements are. Problems where the EM algorithrn has proven to be useful 



i nclude, among many others, mixture of densities (jTitterington et al.l . Il985l ). censored data models 



(|Tanneil . E^ ). etc. The EM algorithm has several appealing properties. Because it relies on com- 
plete data computations, it is generally simple to implement: at each iteration, (i) the so-called E-step 
only involves taking expectation over the conditional distribution of the latent data given the obser- 
vations and (ii) the M-step is analogous to complete data weighted maximum-likelihood estimation. 
Moreover, (in) the EM algorithm naturally is an ascent algorithm, in the sense that it increases the 
(observed) likelihood at each iteration. Finally under some mild additional conditions, (iv) the EM 
algorithm may be sh own to con verge to a stationary point (i.e., a point where the gradient vanishes) 
of the log-hkelihood (IWul . ll983l ). Note that convergence to the maximum likelihood estimator cannot 
in general be guaranteed due to possible presence of multiple stationary points. 

When processing large data sets or data streams however, the EM algorithm becomes impractical 
due to the requirement that the whole data be available at each iteration of the algorithm. For this 
reason, there has been a strong interest for online variants of the EM which make it possible to estimate 
the parameters of a latent data model without storing the data. In this work, we consider online 
algorithms for latent data models with independent observations. The dominan t approach (see also 
Section [2.21 below) to online EM-like estimation follows the method proposed by Titterington ( 19841 ) 
which consists in using a stochastic approximation algorithm, where the parameters are updated after 
each new observation using the gradient of the incomplete data likelihood weighted by the complete 
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data Fisher informatio n matrix. This approach has been used, wit h some variations, in many different 
ap pUcations (see, e . g.. IChung and Bohmel . l2005l : iLiu et al.l . I2OO6I ): a proof of convergence was given 



bv lWang and Zhaol (|2006h 7 

In this contribution, we propose a new onhne EM algorithm that sticks more closely to the 
principles of the original (batch-mode) EM algorithm. In particular, each iteration of the proposed 
algorithm is decomposed into two steps, where the first one is a stochastic approximation version of 
the E-step aimed at incorporating the information brought by the newly available observation, and, 
the second step consists in the maximisation program that appears in the M-step of the traditional EM 
algorithm. In addition, the proposed algorithm does not rely on the complete data information matrix, 
which has two important consequences: firstly, from a practical point of view, the evaluation and 
inversion of the information matrix is no longer required, secondly, the convergence of the procedure 
does not rely on the implicit assumption that the model is well- specified, that is, that the data under 
consideration is actually generated by the model, for some unknown value of the parameter. As a 
consequence, and in contrast to previous work, we provide an analysis of the proposed algorithm 
also for the case where the observations are not assumed to follow the fitted statistical model. This 
consideration is particularly relevant in the case of conditional missing data models, a simple case of 
which is used as an illustration of the proposed online EM algorithm. Finally, it is shown that, with 
the additional use of Polyak-Ruppert averaging, the proposed approach converges to the stationary 
points of the limiting normalised log- likelihood criterion (i.e., the Kullback-Leibler divergence between 
the marginal density of the observations and the model pdf) at a rate which is optimal. 

The paper is organised as follows: In Section [2l we review the basics of the EM and associated 
algorithms and introduce the proposed approach. The connections with other existing methods are 
discussed at the end of Section 12.31 and a simple example of application is described in Section 12.41 
Convergence results are stated in Section [3l first in term of consistency (Section l3.ip and then of con- 
vergence rate (Section 13. 2p . with the corresponding proofs given in Appendix |Al Finally in Section HI 
the performance of this approach is illustrated in the context of mixture of linear regressions. 



2 Algorithm Derivation 

2.1 EM Basics 



In thi s section, we review the key properties of the EM algorithm as introduced by lDempster et al 



(jl977l ). The latent variable statistical model postulates the existence of a non-observable or latent 
variable X distributed under f{x; 9) where {/(x; 9),6 ^ Q} denotes a parametric family of probability 
density functions indexed by a parameter value G C M'^''. The observation Y is then viewed as 
a deterministic function of X which takes its values in the set y. This latent variable mechanism 
provides a unified f ramework for situation s which includes missing data, censored observations, noisily 



observed data, . . . (jPempster et al.l . 119771 ) . 

We will denote by g[y] 0) the (observed) likelihood function induced by the latent data model. 
In addition, the notations 'Ke['] and E£)[-|y] will be used to denote, respectively, the expectation and 
conditional expectation under the model parameterised by 9. Likewise, let vr denote the probability 
density function of the observation Y , where we stress again that we do not restrict ourselves to the 
case where 7r(-) = g{-;9*), for an unknown value 9* of the parameter. The notations and E^r will 
be used to denote probability and expectation under the actual distribution of the observation. 

Given n independent and identically distributed observations Yi-n {Yi, . . . , Yn), the maximum 

likelihood estimator is defined as 9n *== argmaxgge logg{Yi-n', 9), where 

n 

log5(yi:n;^) ='J]log<7(^.;0). (1) 

1=1 
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Note that we normahse the log-hkeUhood (by n) to ease the transition to the onhne setting where n 
increases when new observations become available. 

The EM algorithm is an iterative optimisation algorithm that maximises the above (normalised) 
log-likelihood function despite the possibly complicated form of g resulting from the latent data model. 
Traditionally, each EM iteration is decomposed in two steps. The E-step consists in evaluating the 
conditional expectation 



i=l 



(2) 



where 9k is the current estimate of 9, after k iterations of the algorithm. In the M-step, the value 
of 9 maximising Qn (Yi:n',9) is found. This yields the new parameter estimate 9k+i- This two step 
process is repeated until convergence. The essence of the EM algorith r n is th at increasing Qg {Yi-^n] 9) 



1973). 



forces an increase of the log-likelihood log ^(li:^; ^) (jPempster et al.l . 

For ;U : G — > M a differentiable function, denote by Ve// = {diJ,/d9i, . . . , dii/d9dg)^ the gradient of 



fi. If fi is twice differentiable, we denote by V^/x the Hessian matrix which is a 
components are given by 



matrix whose 



; 1 < < f^e- Following Lange ( 19951 ). if the M-step of the 



EM algorithm is replaced by a Newton update, one obtains, assuming some regularity, the following 
recursion 



h+l — &k + Ik+l 



J{Yl:n', 9k 



j=i 



Velog fix f,9k)\Yi 



(3) 



where jk+i is a step size (7^+1 = 1 correspond to the actual Newton update) and J(Yi:„;0) = 
Yl'i=i Ji^i] 9) with J( y-9) = -Eo [Vjl o R f(X ;9)\Y = y] . Note that due to the so-called Fisher's 



identity (see discussion of Dempster et al. . 1973), the gradient term indeed coincides with the (ob- 
served data) score function as 



Ke [Velogf{X;e)\Y] =VeloggiY; 



(4) 



The algorithm in ^ can be shown to be locally equivalent to the EM algorithm at convergence 
(lLangei . [l995l ). ;n practise, the step-size 7^+1 is often adjusted using line searches to ensure that the 
likelihood is indeed increased at each iteration. In addition, J{Yi-n'i9) is not necessarily a positive 
definite matrix or could be badly conditioned; therefore, some adjustment of the weight matrix 
J{Yi-n;9) may be necessary to avoid numerical problems. 

A well-known modification of the Newton recursion consists in replacing J{Yi-n;9) in (jS]) by the 
Fisher Information Matrix (FIM) associated to a complete observation. 



m^^'-Ee [Vilog/(X;0)] 



(5) 



Under the mild assumption that the complete data model is regular, I{9) is guaranteed to be positive 
definite. This modified recursion, which is mor e robust, may be seen as an approximation of the 
scoring method ( McLachlan and Krishnan , 19971 ). where the complete data FIM is used in place of 
the actual (observed) FIM 

IoU0) = -Ke[VlloggiY;9)] , (6) 



despite the fact that, in general, /obs(^) and I{6) are different. 1(9) usually also differs from J(Yi:„; ( 
as J(Yi-,n',9) converges, as n tends to infinity, to 



U{9) -E, [Ee [Vllog f{X;9)\Y] 



(7) 



which doesn't correspond to a Fisher information matrix in the complete data model, except when vr 
coincides with f{-;9). In the particular case, however, where the complete data model belongs to a 
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canonical (or naturally parameterised) exponential family of distributions, J{Yi-n]()) coincides with 
I{6) and thus does not depend on vr anymore. Hence, except in some specific cases or if one assumes 
that the model is well-specified (i.e., vr = g{-;9*)), the convergence behaviour of the recursion in ([3]) 
will be different when 1(9) is used instead of J{Yi-n',0). 



2.2 Stochastic Gradient EM Algorithms 

Being able to perform online estimation means that the data must be run through only once, which 
is obviously not possible with the vanilla EM algorithm. To overcome this difficulty we consider in 
the sequel online algorithms which produce, at a fixed computational cost, an updated parameter 
estimate 9n for each new observation Yn- Note that in the online setting, the iteration index (which 
was previously denoted by k) is identical to the observation index n and we will use the latter when 
describing the algorithms. To our best knowledge , the first online parameter estimation procedure 
for latent data models is due to lTitteringtonl (|l984l ) who proposed to use a stochastic approximation 



version of the modified gradient recursion: 

6n+i = 9n + 7n+i/~^(4)Ve logg{Yn+i; On) , (8) 

where {7n} is a decreasing sequence of positive step sizes. One may also consider using a stochastic 
approximation version of the original (Newton) recursion in ([3]) : 

Bn+i = 9n + -fn+iI-H9n)Ve logg{Yn+i; 0n) ■ (9) 

Note that ([9]) does not correspond to a practical algorithm as In{9n) is usually unknown, although 
it can be estimated, for instance, by recursively averaging over the values of J{Yn',9n)- As discussed 
above however, this algorithm may be less robust than ^ because J{Yn;9n) is (usually) not guar- 
anteed to be positive definite. In the following, we will refer to ([8]) as Titterington's o nline algorithm 
and to dl]) as the online gradient algorithm (in reference to the title of the paper by Langel . 1995 ). 



Note that both of these algorithms are based on the stochastic gradient approach and bear very little 
resemblance with the original EM algorithm. 



2.3 The Proposed OnHne EM Algorithm 

We now consider an online approach which is more directly related to the principle underpinning the 
EM algorithm. The basic idea is to replace the expectation step by a stochastic approximation step, 
while keeping the maximisation step unchanged. More precisely, at iteration n, consider the function 

Qn+l{0) = Qn{0)+Jn+1 (% [log /(X„+i ; 0) | F^+i] - Qn{9)) , (10) 

and set 9n+i as the maximum of the function 9 i— > Qn+i(9) over the feasible set 0. One important 
advantage of (jlOh compared to ([8]) is that it automatically satisfies the parameter constraints without 
requiring any further modification. In addition, (jlOp does not explicitly require the inversion of a 
{dg X dg) matrix. For further comparisons between both approaches, both practical and in terms of 
rate of convergence, we refer to the example of Section 12.41 and to the analysis of Section [3l 

Of course, this algorithm is of practical interest only if it is possible to compute and maximise 
Qn{G) efficiently. In the following we focus on the case where the complete data likelihood belongs 
to an exponential family satisfying the following assumptions. Let (•, •) denotes the scalar product 
between two vectors of and | • | the associated norm. 

Assumption 1. (a) The complete data likelihood is of the form 

fix; 0) = h{x) exp {-^ + (5(x), m)} ■ (H) 



4 



(b) The function 

siy-e)''^'Ee[SiX)\Y = y] , (12) 
is well defined for all {y,0) £ y x Q. 

(c) There exists a convex open subset S C M.'^, which is such that 

• for all s € S, {y,9) €y X e and-f € [0, 1), (1 - -f)s + js{y; 9) G S. 

• for any s G 5, the function 9 i-^ £{s; 9) =^ — V'(^) + (s, </'(^)) has a unique global maximum 
over Q denoted 9 {s),i.e. 

9 (s) *== argmaxggQ£(s; 9) . (13) 

Assumption [T] implies that the evaluation of Eg [log /(X; 0)|y] , and hence the E-step of the 
EM algorithm, reduces to the computation of the expected value Eg [5(X)|y] of the complete data 
sufficient statistic S{X). Indeed, the EM reestimation functional Q0i(Yi-n',9) is then defined by 



Qe'{Yi-.n,9)=^^-^Yj{Yi-e')-9j . 
The (A; + l)-th iteration of the (batch mode) EM algorithm may thus be expressed as 

4+1 = ^(n-^f;s(yi;4)) , 



(14) 



where the M-step corresponds to the application of the function 6. Note that the construction of the 
set S in Assumption [T]|^c) reflects the fact that in most applications of EM, the M-step is unambiguous 
only when a sufficient number of observations have been gathered. This point will be illustrated in 
the example to be considered in Section H] below. Assumption [T]^c) takes care of this issue in the case 
of the online EM algorithm. As an additional comment about Assumption [H note that we do not 
require that (p he & one to one mapping and hence the complete data model may also correspond 
to a cur ved exponential farnily, w h ere typically 9 is o f much lower dimension than ip{9) (see, for 



instance, Ichun. and B5hmj M): ICar^r^e et al.l » for an example involving Gaussian densities 



with structured covariance matrices). 

In this setting, the proposed online EM algorithm takes the following form 

Sn+l = Sn+ 'yn+l{s{Yn+l]9n) - Sn) , 
9n+l = 9 (Sn+l) ■ 



(15) 



Algorithm of that kind have a rather long history in the machine learning community. Th e idea of 
sequen tially updating the vector of sufficient statistics has apparentl y been first proposed bvlNowlan 
(1991)) using a fixed step size (or learning rate) 7n = 7 (see also Jordan and Jacobi . 1994 ). The 
online EM algorithm (I15D i s also closely related to the "incremental" version of the EM algorithm 
derived by Neal and Hinton ( 19991 ). The incremental setting is more general than the recursive setting 
considered here, because the observations are not necess arily processed s equentially in time and might 
be used several times. The incremental EM algorithm of lNeal and Hintoru (jl999 ) defines the (A;+l)-th 
parameter estimate as 



min(fc+l,n,) 



h+1 = 9 [mm{k + l,n)Y 



E 

i=l 



Sk+1,' 



(16) 
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where Sk+i,i = Sk,i if i / h+i and 5^+1,7^+1 = The index 4+1 is typically chosen as 

k -\- 1 while k < n — 1 and runs through the data set, that is, € {1, . . . , n}, in a fixed or pseudo 
random scanning order for subsequent iterations. When used in batch mode (that is when A; > n) it 
is seen that it mostly differs from the traditional EM strategy in (jl4p by the fact that the parameters 
are updated after each computation of the conditional expectation of the complete data sufficient 
statistic corresponding to one observation. When used in online mode (k < n), the algorithm of 



Neal and HintonI (jl999l ) coincides with the proposed online EM algorithm with a step-size of 7^ = 1/k 



(see Section [3] for further discussion of this particular choice of step s izes). A specific instance of the 
proposed online EM algorithm has been derived by Sato and Ishii ( 200d ) for maximum likeli hood 



estima tion in the so-called normalised Gaussian network; this algorithm was later extended by ISato 



( 200d ) to a canonical exponential family {4>{9) = 6 in (jlll) ) and a sketch of the proof of convergence 



based on stochastic approximation results, was given. The online EM algorithm defined in (jl5p may 
be seen as a generalisation of this scheme. 

2.4 An Example: Poisson Mixture 

Before analysing the c onvergenc e of th e above algorithm, we first consider a simple example of appli- 
cation borrowed from Liu et al. ( 20061 ): consider the case of a mixture of m Poisson distributions 



m 

giy;9) = ^u;j^e-^^ , for y = 0, 1, 2, . . . , (17) 
i=i ^' 

where the unknown parameters 9 = (wi, . . . , tOm, Ai, . . . , A^) satisfies the constraints Uj > 0, ujj = 
1 and Xj > 0. In the mixture problem, the incompleteness is caused by the ignorance of the com- 
ponent of the mixture. Let be a random variable taking value in {1, . . . ,m} with probabilities 
{uJi, ... ,ujm}- The random variable W is called the regime or state and is not observable. The 
probability density defined in (fT7|) corresponds to the assumption that Y is distributed, given that 
W = j, according to a Poisson random variable with parameter Xj. Note that in this case, as in 
all examples which involve the simpler missing data mechanism rather than the general latent data 
model introduced in Section 12.11 the complete data X simply consists of the couple {Y, W) and hence 
conditional expectations of X given Y really boils down to expectations of W given Y. 
For the Poisson mixture, the complete data log-likelihood is given by 

m m 

log f{y,w; 9) = -log(y!) + ^ [logiivj) - Xj] <5„j + ^log{Xj)y6^j , (18) 

i=i i=i 

where 6ij is the Kronecker delta symbol: 6i^i = 1 if i = I and 6i^i = otherwise. The complete data 
likelihood may be rewritten as in ([TT]) with h{y,w) = — log(y!), S{y,w) = {Si{y,w), . . . ,Smiy,w)) 
and ct>{9) = {M0),---,(l^rn{9)),wlieve 

In this case, the conditional expectation of the complete data sufficient statistics is fully determined 
by the posterior probabilities of the mixture components defined by 

Wj{y- 9) ¥e[W = j\Y = y] = ^^'^'\y ' , for j = 1, . . . , m . 
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The (n + l)-th step of the onUne EM algorithm consists in computing, for j = 1, . . . , m, 



Wj(Yn+i;9n) 
'Wj{Yn+i;On)Yn+l 
Sj,n+l(2) 



'^j,n+l — Sj^n+li'i-) , -^j.n+l — ^T^ 7TT • (19) 

Comparing with the generic update equations in (|15p . one recognises the stochastic approximation 
version of the E-step, in the first hne of (jl9p . followed by the application of 0. 

To compare with Titterington's online algorithm in ([8]), one first need to evaluate the complete 
Fisher information matrix I{0). To deal with the equality constraint YlY=i^j ~ 1' only the first 
m — 1 weights are use d as parameters and the remaining one is represented as Um = 1 — ^T=i ^ 



m 



TitteringtonI ( 19841 ) . The complete data Fisher information matrix defined in ([5]) is then given by 



m 



diag(u;i , . . . ,uj^_^) + ujj 

Im-llm-l 0(m-l)xm 

^mx (m— 1) 

diag(a;i/ Ai, . . . , w^/ Am) 



where the superscript T denotes transposition, 1 and respectively denote a vector of ones and 
a matrix of zeros, whose dimensions are specified as subscript. Upon inverting I{0), the following 
expression for the (n + l)-th step of Titterington's online algorithm is obtained : 



l^j^n+l = l^j,n + 7n+l {wj(Xn+l,On) — l^j,n^ , 



^j,n+l — Aj,n + 7ra+l "'^ T"*""^' "'^ ( Yn+1 — ^j,n ) ■ (20) 



To make the connection more explicit with the update of the online EM algorithm, note that due to 
the fact that, in this simple case, there is an identification between some components of the vector 
of sufficient statistics and the weight parameters (i.e., 6 (sj,n) = it is possible to rewrite (|19|) in 
terms of the latter only: 

"nj ^j,n I ) 

>^j,nUJj,n + 7n+l {uJj (^n+1 ; ^n)'5^n+l " 
Aj,n+1 = Z • 

In the Poisson mixture example, the two algorithms differ only in the way the intensities of the 
Poisson components are updated. Whereas the online EM algorithm in (jl9p does ensure that all 
parameter constraints are satisfied, it may happen, in contrast, that (j20p produces negatives values 
for the intensities. Near convergence however, the two algorithms behave very similarly in this simple 
case (see Proposition [6] below). 

2.5 Extensions 

As previously mentioned, Neal and Hinton ( 19991 ) advocate the use of online algorithms also in the 



case of batch training with large sample sizes. The online algorithm then operates by repeatedly scan- 
ning through the available sample. In our setting, this use of the proposed online EM algorithm may 
be analysed by letting vr denote the empirical measure associated with the fixed sample Xi, . . . ,Xn- 
The results to follow thus also apply in this context, at least when the data scanning order is random. 

In semi-parametric regression models, each observation Y comes with a vector of covariates Z 
whose distribution is usually unspecified and treated as a nuisance parameter. To handle latent data 



versions of regression models (mixture of regressions, mixture of experts, etc. — see iGriin and Leischl . 
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20071 : 1 Jordan and Jacobsl . ll994l . as well as the example of Section in our framework, one only needs 



to assume that the model consists of a parametric family {f{x\z;6),9 € G)} of conditional pdfs. In 
this setting however, it is not possible anymore to compute expectations under the complete data 
distribution and the model can never be well-specified, as the distribution of Z is left unspecified. 
Thus Titterington's algorithm in ([8]) does not directly apply in this setting. In contrast, the proposed 
algorithm straightforwardly extends to this case by considering covariate-dependent expectations of 
the sufficient statistics of f{x\z;6), of the form s{y,z;6) = E0[5'(X)|y = y,Z = z], instead of (fT2|) . 
For notational simplicity, we state our results in the following section without assuming the presence 
of covariates but extension to the case where there are covariates is straightforward; the example of 
Section H] corresponds to a case where covariates are available. 



3 Convergence Issues 
3.1 Consistency 

In this section, we establish the convergence of the proposed algorithm towards the set of stationary 
points of the Kullback-Leibler divergence between the ac tual observatioii density and the model 
likelihood. These results are the analogues of those given bv lWang and Zhaol (|2006l l for Titterington's 
online algorithm, with a somewhat broader scope since we do not assume that the model is well- 
specified. The proofs corresponding to this section are given in Appendix [XI In addition to the 
conditions listed in Assumption [H we will require the following additional regularity assumptions. 

Assumption 2. (a) The parameter space Q is a convex open subset ofW^^ and ijj and (p in (UD are 
twice continuously differentiable on Q. 

(b) The function s 6 (s), defined in (I13p . is continuously differentiable on S, 

(c) For some p > 2, and all compact subsets /C C 5, 

supE^ (|s(y;^(s))|^) < oo . 

To analyse the recursion (jlSp , the first step consists in expressing it as a standard Robbins- Monro 
stochastic approximation procedure operating on the complete data sufficient statistics: 

+ 7„+l (h(s„) + Cn+l) , (21) 

where h : 5 — > M*^^ is the so-called mean field given by 

h{s) = E^[s{Y;e{s))]-s, (22) 
and {^n}n>i is a sequence of random variables representing stochastic perturbations defined by 

^n+l = Kyn+i;Hsn))-E [s{Yn+i-, 6 {Sn))\Tn\ , (23) 

where Tn is the u-field generated by (sq, {li}^=i). The aim of the Robbins-Monro procedure (i2T]l is 
to solve the equation h(s) = 0. As a preliminary step, we first characterise the set of roots of the 
mean field h. The following proposition shows that, if s* belongs to 

r = {s€S: h(s) = 0} , (24) 



then 6* = 9 (s*) is a stationary point of the Kullback-Leibler divergence between vr and 

7r(y) 



log 



9iY; 



(25) 
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Proposition 3. Under Assumptions{I\{^ if s* G S is a root o/h, i.e., h(s*) = 0, then 6* = 6 (s*) is 
a .stationary point of the function i— > K (vr \\gg), i.e., Vg K (vr ||5'6»)le=6»* ~ 0. Conversely, if 9* is a 
stationary point o/ ^ K (vr Wge), then s* = Et^[s{Y; 9*)] is a root of h. 

We then show that the function w : 5 — > [0, oo) defined by 

9eis)) ' (26) 

is a Lyapunov function for the mean field h and the set T, i.e. for any s £ S, (Vsw(s), h(s)) < 
and (Vsw(s), h(s)) = if and only if h(s) = 0. The existence of a Lyapunov function is a standard 
argument to prove the global asymptotic stability of the solutions of the Robbins-Monro procedure. 
This property can be seen as an analog of the monotonicity property of the EM algorithm: each 
unperturbed iteration s^+i = Sfc + 7fc+ih(sfc) decreases the Kullback-Leibler divergence to the target 
distribution vr, provided that jk+i is small enough. 

Proposition 4. Under Assumptions[l\{^ 

• w(s) is continuously difjerentiable on S, 

• for any compact subset IC C S \ r, 

sup (Vsw(s), h(s)) < . 



w(s) =^ K ^vr 



Using this result, we may now prove the convergence of the sequence {sfe}. Denote by £ = 
{9 £ Q : V^iK (vr ||(76i ) = 0} the set of stationary points of the Kullback-Leibler divergence, and, for 
X e M*" and ^ C M™, let d{x, A) = inf{y £ A,\x - y\}. 

Theorem 5. Assume{I\\M and that, in addition, 

1. < 7i < 1, 7i = and X^^^i < oo, 

2. sq £ S and with probability one, limsup|s„| < oo and liminf 5"^) > 0. 

3. The set w{T) is nowhere dense. 

Then, lim^^oo '^(•Sm L) = and lim„_,oo '^(^rn >C) = 0, with probability one. 



The first condition of Theorem 



5] is standard for decreasing step-size stochastic approximation 



with a £ 



procedures (iKushner and Yinl . 119971 ). It is satisfied for instance by setting 7. 
(1/2,1]. The additional requirements that 7^ be less than 1 and sq be chosen in S is just meant 
to ensure that the whole sequence {sk} stays in S (see Assumption [T]J^c) ) . The rest of the second 
assumption of Theorem[5] correspond to a stability assumption which is not trivial. In general settings, 
the stability of the algorithm can be enforced by truncating the algorithm updates, either on a fixed 
set (s ee, e.g.. iKushiier and Yinl. l2003l. ch apter 2) or on an expanding sequence of sets (see, e.g.. IChenl . 
2OO2I . chapter 2, or lAndrieu et al.l . l2005l ^. We do not explicitly carry out these constructions here to 
keep the exposition concise. 



3.2 Rate of Convergence 



In this section, we show that when approaching convergence, the online EM algorithm is comparable 
to the online gradient algorith m in (|9l). The exist ence of such links is hardly su rprising, in view of 
the discussions in Section 4 of Titterington ( 19841 ) and Section 3 of Langg (1995), and may be seen 
as a counter part, for stoc hastic approximation, of the asymptotic equivalence of the gradient EM 



Langd (119951 ) and the EM algorithm. To highlight these relations, we first express the 



algorithm of 

online EM algorithm as a stochastic approximation procedure on 9. 
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Proposition 6. Under the assumptions of Theorenn\^ the online EM sequence {^n}n>o given by (jlSp 
follows the recursion 

+ 7„+i 4 ^(4)Ve log5(y„+i; + 7„+i/9„+i (27) 
where lim„_>oo Pn. = a.s. and It^{6) is defined in 0. 

Hence, the online EM algorithm is equivalent, when approaching convergence, to the online gradi- 
ent algorithm defined in ^ which coincides with Titterington's online algorithm with lTr{On) substi- 
tuted for I{9n)- It is remarkable that the online EM algorithm can achieve a convergence performance 
similar to that of the online gradient algorithm without explicit matrix approximation nor inversion. 
Note that, as previously discussed in Section 12.11 in the particular case of canonical exponential 
families, and I{0) coincide and the proposed online EM algorithm is thus also equivalent (near 

convergence) to Titterington's online algorithm. 

Although the recursion (|27|) will not lead to asymptotic efficiency, we can, under appropriate 

— 1/2 

additional conditions, guara ntee 7n ^ -cons istency and asymptotic normality. We use the weak con- 
vergence result presented in IPelletieil . 1998 . Theorem 1. 

Theorem 7. Under the assumptions of Theorem let 9* be a (possibly local) minimum of the 
KuHback-Leibler divergence: 9 >-^K{tt \\ge)- Denote by 

dof 



H{9n''^'l-\9*) [-VlK{7r\ 



r(r) ''^'l~\9*)E^ (Velogg{Y;9*){Velogg{Y;9*)f) I~\ 



Then, 



1. H(9*) is a stable matrix whose eigenvalues have their real part upper bounded by —X{9*), where 
X{9*) > 0. 

2. Let 7„ = 7on~°, where 70 may be chosen freely in (0, 1) when a £ (1/2, 1) but must satisfy 70 > 
X{9*) when a = 1; then, on the event 0,(9*) = {lim„^oo = (^*}, the sequence 7„, (j)n — 
converges in distribution to a zero mean Gaussian distribution with covariance S(0*), where 
$](^*) is the solution of the Lyapunov equation 

{H{9*) + CId) S(r) + S(r) {H^{9*) + CId) = -T{9*) , (28) 

where C = if a E (1/2, 1) and ( = -Jq^ if a = 1, and. Id denotes the identity matrix. 

Solving (j28p is easy for a well-specified model, that is, when vr = gg* , as the FIM /obs(^*) associated 
to the (observed) data model then satisfies 

/obs(^*) = -Ee* [Vl\oggiY-9^)] = I-\9^) 

= -Vi K {ge^ \\ge)\e=e* = (Ve log g{Y; 9*) {Ve log g{Y; 9*)}' 



When C = 0) solution of the Lyapunov equation is given by S(6'*) = I^^^^{9'')/2, the asymptotic 
covariance matrix is equal to one-half the inverse of the FIM. When ( ^ 0, the Lyapunov equation 
cannot be so l ved in explicitly, except when the parameter is scalar (the result then coincides with 



Titteringtonl . Il984l . Theorem 1). Note that using 7^ = 70?! " with a = 1 provides the optimal 



convergence rate of l/\/n but only at the price of a constraint on the scale 70, which is usually 
impossible to check in practice. On the other hand, using a € (1/2, 1) results in a slower convergence 
rate but without constraint on the scale 70 of the step-size (except for the fact that it has to be 
smaller than 1). 
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To circumvent this difficulty, we recommend to use the so-called Polya k-Ruppe r t aver aging tech- 
nique (Polvak . 1990 : Ruppert , 198S) as a post-proce s sing s tep. Following Polvak ( 199d ) — see also 
Polvak and Juditskvl (|1992l ^ : iMokkadem and Pelletieil (j2006l )— . if 7 = 70^"", with a G (1/2, 1), then 



the running average 



def 



n 



j=no 



'] 1 



n> I 



(29) 



converges at rate l/-v/n, for all values of 70. Furthermore, on the event $7(0*) defined in Theorem [7] 
above, ^/n{6n — 9*) is asymptotically normal, with asymptotic covariance matrix 



S(r) = H-\9*)T{9'')H-\e*) = 

[-VlK{7r\\ge)\s^s.]~'7r(Veloggg.{Velogg,.f) [-Vj K {ir \\ge)\s^s.]~' , (30) 

which is known to be optimal ( Kushner and Yin . 199?! ). If tt = gg*, the previous result shows that 
the averaged sequence 9n is an asymptotically efficient sequence of estimates of 9*, i.e. the asymptotic 
covariance of \/n{9n — 9*) is equal to the inverse of the (observed data) FIM /obs(0*)- 



4 Application to Mixtures of Gaussian Regressions 



To illustrate the performance of the proposed method, we consider a regression model which, as 
discussed in Section 12.51 corresponds to a case where the complete data FIM is not available. In 
contrast, we illustrate below the fact that the proposed algorithm, without explicitly requesting the 
determination of a weighting matrix does provide asymptotically efficient parameter estimates when 
Polyak-Ruppert averaging is used. 

The model we consider is a finite mixture of Gaussian linear regressions, where the complete 
data consists of the response variable i?, here assumed to be scalar for simplicity, the d^-dimensional 
vector Z that contains the explanatory variables, and, W which corresponds, as in the example of 
Section 12.41 to a latent mixture indicator taking its value in the finite set { 1 , . . . , m} . We assume that 



given W = j and Z, R is distributed as a J\f{PjZ, cr|) Gaussian variable, while W is independent 
of Z and such that P6»(W^ = j) = u>j. Thus the parameters of the model are the mixture weights ujj 
and the regression vectors (3j and variances cr|, for j = 1, . . . ,m. As is usually the case in conditional 
regression models, we specify only the part of the complete data likelihood that depends on the 
parameters, without explicitly modelling the marginal distribution of the vector of regressors Z. In 
terms of our general notations, the complete data X is the triple {R, Z, W), the observed data is the 
couple {R, Z) and the model is not well-specified, in the sense that the dis tri bution of the observation 
{R, Z) is not fully determined by the model. We refer to Hurn et aD ( 2003 ) or Griin and Leisch ( 200?! ) 
and references therein for more information on mixture of regression models and their practical use. 

In the mixture of Gaussian regressions model, the part of the complete data log-likelihood that 
depends on the parameters may be written as 



^ogf{r,w,z;9) = ^ 



log{ujj 



logaf + 



(31) 



where 6 denotes, as before, the Kronecker delta. To put ([3T|) in the form given in ([TT]) . one needs to 
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define the statistics S = {Sij, S2J, S^j, Sij)i<j<:rn where 



Sij{r,w,z) 
S2jir,w,z) 
S3j{r,w,z) 
S4j{r,w,z) 



(scalar) , 
(4 X 1) , 
{dz X 4) , 
(scalar) . 



(32) 



As in the simple Poisson mixture example of Section 12.41 the E-step statistics only depend on the 
conditional expectation of the indicator variable W through 

sij(r,z;6') = Wj{r,z;e) , 

52, j{r,z;9) = Wj{r,z;e)rz , 

53, j{r, z; 9) = Wj{r, z; 6)rzz'^ , 

S4 J (r, z;9) = % (r, z;e)r'^ , (33) 

where 



Wj{r, z; 0) = EelW = j\R = r,Z 



Finally, it is easily checked that that the M-step is equivalent to an application of the function 



exp 




lir-pjz?' 




El^i^exp 


[ 2 J 



: s 



l<j<m 



where 

Ulj{s) = 

a] is) - 



si 



(34) 



In this example, the role played by the set S in Assumption H^c) is important: In order to 
apply (fM|) . its required that the scalars sij belong to the open set (0,1) and that the {dz + 1)- 
dimensional matrices block-defined by 



Mi 



S3,j S2J 
T 

^2,j ^4,j 



be positive definite, since (^jis) is, up to normalisation by sij, the Schur complement of Mj. These 
constraints, for j = 1, . . . , m define the set S which is indeed open and convex. The function s defined 
in (j33p however never produces values of s which are in S. In particular, ssj{r, z;9) is a rank one 
matrix which is not invertible (unless dz = 1). Hence the importance of using an initialisation sq 
which is chosen in S. For the simulations below, we took care of this issue by inhibiting the parameter 
re-estimation step in (I34p for the first twenty observations of each run. In other words, the first twenty 
observations are used only to build a up a value of §20 , using the first line of (jlSp , which is in S with 
great probability. 

For illustration purpose, we consider a variation of a simple simulation example used in the 
f lexmix R package ([Leischl . |2004| ). where m = 2, uJi = uj2 = 0.5, and 



R 



5U + V (when W = 1) 

15 + 10 [/ - C/2 + F (when W = 2) ' 



where U ~ Unif(0, 10) and V ~ M{0, 9^). In order to balance the asymptotic variances of the regres- 
sion parameters (see below) we used Z-^ = {1,U, U"^ /lO) as the vector of regressors, hence the actual 
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value of the regression parameter is jSf = (0, 5, 0) for the first component and /?J = (15, 10, —10). The 
corresponding data is shown in Figure [1] where the points corresponding to both classes are plotted 
differently for illustration purpose, despite the fact that only unsupervised estimation is considered 
here. The labelling is indeed rather ambiguous in this case as the posterior probability of belonging 
to one of the two classes is between 0.25 and 0.75 for about 40% of the points. 
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Figure 1: 500 points drawn from the model: circles, points drawn from the first class, and, crosses, 
points drawn from the second class (the algorithms discussed here ignore the class labels). 

Clearly, the mixture of regressions model is such that the associated complete data sufficient 
likelihood has the form given in Assumption [Ul^a) , where the marginal density of the explanatory 
variables in Z appears in the term h{x) since it does not depend on the parameters. Hence the 
previous theory applies straightforwardly and the online EM algorithm may be used to maximise 
the conditional likelihood function of the responses R given the regressors Z. However, the explicit 
evaluation of the complete data FIM I{9) defined in ([5]) is not an option here because the model does 
not specify the marginal distribution of Z. Titterington's online algorithm may thus not be used 
directly. Applying the recursion in ([8]) without a weighting matrix is not recommended here as the 
regression parameters are greatly correlated due to the non-orthogonality of the regressors. 

In order to determine a suitable weighting matrix, one can use Fisher's relation which gives, 
for the regression parameters, 

\/f3Aogg{r\z;d)=Ee [V pAog f{R,W\Z;e)\ R = r, Z = z;e] 

(r — (3jz)z 
= Wj{r, z; 6) ^ . 

Hence, if we assume that the model is well-specified, the (observed) FIM /obs(^) may be approximated, 
near convergence, by computing empirical averages of the form 

n 

1/n V^^, log g{Ri\Zf, 9) {V^^, log g{Ri\Z.i; 9) f . 

i=l 

As the online EM algorithm does not require such computations, this estimate has been used only to 
determine the FIM at the actual parameter value for comparison purpose. It is easily checked that 



{R-(3jZ)Z 



R = r,Z 
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Figure 2: Box-and-whisker plots of the three components of (32 (from top to bottom) estimated from 
500 independent runs of length n = 100 for, EM5: five iterations of the batch EM algorithm, OLl: 
online EM algorithm with ji = OL06: online EM algorithm with 7^ = OL06a: online EM 

algorithm with 7j = and averaging started from the 50th iteration. The horizontal dashed line 

corresponds to the actual parameter value and the interval in bold at the right of each plot to the 
interquartile range deduced from the asymptotic normal approximation. 
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Figure 3: Same plots as in Figure [2] for signals of length n 
from the 5,000th iteration). 



10, 000 (OL06a uses averaging started 
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due to the linearity of the model and the fact that both components have equal weights and variance, 
the covariance matrices for /?i and P2 are the same. The numerical approximation determined from 
a million simulated observations yields asymptotic standard deviations of (47.8,22.1,21.1) for the 
coordinates of jSj, with an associated correlation matrix of 

1 -0.87 0.75 \ 
-0.87 1 -0.97 . 
0.75 -0.97 1 / 

As noted above, the coordinates of the regression vector are very correlated which would make the 
unweighted parameter-space stochastic approximation algorithm (i.e., ([8]) with an identity matrix 
instead of I~^{9n)) very inefficient. 
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Figure 4: Example of parameters trajectories for the three components of (32 (from top to bottom) 
for a signal of length n = 5, 000: OLl, dashed line, OL06 doted line, OL06a solid line (with averaging 
started after 1,000 iterations. 

For run-lengths of n = 100 and n = 10, 000 observations, we illustrate the performance of the 
following four algorithmic options : 

EMS Five iterations of the batch EM algorithm, using the whole data. 
OLl The online EM algorithm with step size 7n = 1/n. 
OL06 The online EM algorithm with step size 7„ = n""'^. 

OL06a The online EM algorithm with step size 7„ = n~^'^ and averaging started from no = n/2 
according to (129^ . 

Note that whereas OLl and OL06a have the same computational complexity (as the averaging post- 
processing has a negligible cost) , EM5 is significantly more costly requiring five times as many E-step 
computations; it is also non-recursive. All algorithm are started from the same point and run for 500 
independent simulated replicas of the data. The results (for (32) are summarised as box-and- whisker 
plots in Figure [H for n = 100, and Figure [3] for n = 10, 000. Comparing both figures, one observes 
that OL06a is the only approach which appears to be consistent with a variance compatible with 
the asymptotic interquartile range shown on the right of each plot. EM5 (five iterations of batch 
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EM) is clearly the method which has the less variability but Figure [3] suggests that it is not 
consistent, w hich was indeed c o nfirrn ed using longer runs not shown here. This observation supports 
the claim of Neal and Hinton ( 19991 ) that, for large sample sizes, online EM approaches are more 
efficient, from a computational point of view, than the batch EM algorithm which requires several 
iterations to converge properly. The online EM with step size 7„ = 1/n (OLl) presents a bias which 
becomes very significant when n increases. According to Theorem [71 this problem could be avoided 
(asymptotically) by choosing a sufficiently small value of 70. For fixed n however, lowering 70 can 
only reduce the perceived speed of convergence, which is already very slow, as illustrated by Figure HI 
In contrast, the online EM algorithm with Polyak-Ruppert averaging (OL06a) appears to be very 
efficient: averaging significantly reduces the variability of the OL06 estimate, reducing it to a level 
which is consistent with the asymptotic interquartile range, while maintaining a systematic bias which 
vanishes as n increases, as expected. 



5 Conclusion 

Compared to other alternatives, the main advantages of the proposed approach to online parameter 
estimation in latent data models are its analogy with the standard batch EM algorithm, which 
makes the online algorithm easy to implement, and its provably optimal convergence behaviour. In 
addition, the combination of a slow parameter decrease (7,1 = n""'^"*"^ being a typical choice) with 
Polyak-Ruppert averaging appears to be very robust. 

A limitation is the fact that the function 9 (s) has to be explicit, which, for instance, would not 
be the case for mixture of regression models with generalised link functions. Another extension of 
interest concerns non independent models and in particular hidden Markov models or Markov random 
fields. 
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A Proofs 

For fj. = {fii, . . . , /im)^ a difFerentiable function from Q to W^, we define by Vefi^ the x m matrix 
whose columns are the gradients, Ve/i"^ = [Vefii, . . . , Ve/im,]- Thus the symbol denotes either a 
vector or a matrix, depending on whether the function to which it is applied is scalar or vector-valued. 
With this convention, the matrix Vg/i"^ is the transpose of the usual Jacobian matrix. 

Proof of Propositionl^ Let s* be a root of h, and set 9* = 6{s*). For any s € S, the function 
6 I— > £{s; 9) has a unique stationary point at 9 (s). Therefore, 

- V9^(r ) + Vef{9*)s^ = . (35) 

Note that, from Fisher's identity, 

V e\ogg{y- 9) =W.e[V e\og f {X- 9)\Y = y] = -Vei^{0)+VBf{0)-s{v.O) • (36) 

As VgK (vr \\gQ ) = [Vg \ogg{Y] 9)], the latter relation implies 

VeK (^ \\ge ) = -Ve^{9) + Vef{9)^. [s{Y- 9)] . (37) 

Since h(s*) = 0, = E^[s(y; 6**)], and thus ^ and ([371) imply 

VeK (vr ) |,^,. = - Ve^(r ) + Ve</.^(r )s^ = , 

establishing the first assertion. Conversely, suppose that VgK (vr H^e )le=e)* = and set s* = E,r[^(i^; 9*)]. 
By dSZD, 

-Vgi,{9*)+Vef{9*)s'' = Q. 

Under Assumption [T]|^c) , the function 9 i— > —'ip{9) + {4>{9), s*) has a unique stationary point at 9 (s*), 
which is a maximum. Hence, 9* = 9 (s*) establishing the second assertion. □ 

Proof of Proposition Using ()37p and the chain rule of differentiation, 

V,w(s) = -V,^^ {s) {-Ve^lj{9 {s)) + Ve(t>^{9 {s))E^ [s{Y; 9 (s))] } . (38) 
For any s £ S, 9 (s) is the maximum of ^ ^ i{s; 9), thus, 

Ve£{s; ^)|,=,-(,) = = -Ve^{9 (s)) + Ve(p^{9 {s))s . (39) 
Plugging this relation into (j38p and using the definition (|22p . we obtain 

V,w(s) = - V,^^ (s) VefiO (s))h(s) . (40) 
By differentiating the function s ^ ^{s] 9 (s)) where ^{s, 9) =^ Vgi{s; 9), we obtain 
V,$^(s; 9 {s)) = Vs^^is; 9 (s)) + V.F (s) V,$^(s, 9 (s)) . 

Since V,$^(s;e) = V s e^{s; 9)f = [Ve^'^{9)]'^ and Ve'^'^{s,9) = Vl£{s,9), the latter relation 
may be alternatively written as 



Ve^(s;^)|,=,-(,) = [Ve^^i9is))f + ^s9^{s) V^^(s;( 



Because the function s Ve^(s; 0)1^^^^^^ is identically equal to 0, Vs[ V6)£(s; ^)|g^g(-^^]'^ = 0, and 
thus, 

V,0^(^"(s)) = - Vli{s;9)l^^^^^ [ys9^is)f . (41) 
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Plugging this relation into (j40p yields 

(V,w(s), h(s)) = h^(s) [Vef{9 {s))] ^ { V^^(s; ^) V,<^^(e" (s))h( 



(42) 

where assumption [T]||c) implies that, for any s e 5, the matrix V^£(s; is negative defi- 

nite (and hence invertible). Hence, for any s G 5, (Vsw(s), h(s)) < with equality if and only if 
Ve</>^(^(s))h(s) = 0. 

To deal with the equality case, assume that s* is such that SJ Q(j^ (Q (s*))h(s*) = 0, or equivalently 
that 

Ve<^^(^ (s*))E^ \s{Y-~Q{s'))\ = Ve0^(^(s^))s* . 

Under Assumptions [TK^a) and[Hc), 9* = 6 (s*) is the unique solution to the score equation (|39p . that 
i_s, such that Veil^iO*) = Vg^^{9*)s\ Hence, Ve^'^i9*)E^[s{Y;9*)] = Vgij{9*), which implies that 
9 (E^ [s{y', 9*)]) = 9* showing that 9* £ C and thus, from Proposition [3l that s* G S. The proof then 
follows upon noting that s (Vsw(s), h(s)) is continuous. □ 

Proof of Theorem\^ Under the stated assumptions, for any e > 0, there exists a compact fC C S and 
n, such that, F [Clkyni^n G •^}) > 1 — £• Therefore, for any r] > 0, 



P sup 

\ k>n 



< 



sup 

k>n 



k 



< e + : 



sup 

, k>n 



i>n 



> rj 



, i>n 



Note that Mn,k = '^i=nli^i^ic{si) is a L2-martingale, and that its angle-bracket is bounded by 
suPsgyc E^[|s(y; ^ (s))p] Yli=n'yi ^ Using Chebyshev's inequality associated to the Doob martin- 
gale inequality, we conclude that 



sup|M„,fc| >77 <2ij-^supE^[\siY;9is))\^]J27^, 



k>n 



seK 



2 

i ) 



finally showing that lim sup„ sup^->^ |M r j.^fc| = with probability one. The proof is concluded by 
applying Theorem 2.3 of Andrieu et al. ( 20051 ) which states that the sequence of sufficient statistics 
defined by (I2ip is then such that limsup„ d{sn, F) = with probability one; the result on the sequence 
of parameter estimates 0„ follows by continuity of 9. □ 

To prove Proposition [6l we will make use of the following stability lemma. 

Lemma 8. Letp > 1. Assume that for any compact subset IC C S, supg^j^^^jr [\s{Y;9 {s))\'''~\ < oo for 
some p > and that P^-a.s. limsn exists and belongs to S. Then, the sequence {s(y„+i; ^ (s„))}„>o 
is bounded in probability, i.e. 

lim limsupP(|s(y„+i;^(s„) |)| > M) = . 
M^oo n— ►oo 
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Proof. Let /C be a compact subset of S. We may decompose P 9 |)| > M) as follows 



F{\siYn+i;Hsn)\)\ >M) <P(s„0/C)+P(|s(y„+i;^(s„)|)| >M,s„G/C) 



< P(s„ /C) + A/-f supE^ [\S{Y,e{s))\P]¥{sn G /C) 
se/c 



which implies that 



limsupP(|s(y„+i;^(s„)|)| > M) < P( lim s„ ^ /C) + Af-^ sup [|5(y, ^ (s))|^ 



Proof of Proposition 0. A Taylor expansion with integral remainder shows that 

dn+l = 6 (^Sn+ 7n+l (^s{Yn+i; 9„ 

= 6n + 7n+i [Vs(9'^ (sn)]^ (^s(y"„+i ; - s„j + 7.„+ir„+i 
where the remainder r„_|_i is given by 



□ 



(43) 



X def 



s{Yn+i;9n) - Sr 



J [V,F (s„ + t7„+i {s(y„+i; - - V,^^ (s„)] dt . (44) 



We will first show that lim„_>oo '"n = a.s. Lemma [8] shows that {s(l^+i; 0„)}„>o is bounded 
in probability, which we denote by s(l^+i;0„) = Op(l). Under the assumption of Theorem [5l 
Sn = Op(l)) which implies that s{Yn+i]6n) — Sn = Op(l)- Choose e > and then a compact subset 
IC and a constant M large enough such that 



limsupP (s„ /C) + limsupP ( s(Yn+i] 9^) - s„ 



> M < e 



(45) 



Since the function Vs^ (•) is assumed continuous, it is uniformly continuous over every compact subset, 



def 



i.e., there exists a constant 5q such that ICso = {s £ S,d{s,lC) < 60} C 5 and 

sup \Vs0{s + h)-VsO{s)\<(^ ■ 

\h\<So,s&K 



(46) 



Since lim„^oo7n = and (s(y"„+i; 0.„) — s„) is bounded in probability, 7„+i(s(y„+i; 0„) — s„) converges 
to zero in probability, which we denote by 7„+i(s(y„+i; On) — s„) = op(l). For 5o > satisfying ([16|) . 
this implies in particular that 



lim P 7n+i 



•5(^+1; On) — Sr 



> <5n = . 



Therefore, 



limsupP (|r„+i I > Me) < limsupP 7„+i s(y„+i;^: 



> 5c 



+ limsupP (s„ /C) + limsupP ( s(y„+i; 9, 



> M] <€ 



showing that r^ = op(l). 
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We now proceed with the first order term in (j43p . Prom (j4ip we have, 



In addition, ([36]) and ([39|) show that 

V,(A^ =V,log5(y;^(s)) . 

Therefore, by combining (|47|) and (|48|) . the first order term in (|43|) may be rewritten as 



=e(s) 



(47) 



(48) 



[V,^^(s„)]^(s(y„+i;^(s„)) =I-i(^„,)Velogff(F„+i;^„) 



+ 



I-\en)\Velogg{Yr,+i;er, 



(49) 



Note also 



By definition, the complete data FIM may be rewritten as lTr{0) = Vg£ {s;9)\_^^^ [sCy-e)]' 

that On converges a.s. to 9* and that 1^(9*) is assumed to be positive definite. Hence, to ensure that 
the term between braces in ()49p may be neglected, we need to show that: 



'^eKsn;9)\ 



Vleis;9)\ 



(50) 



Since the function {s,9) ^ \7gi{s;9) is continuous, there exists 6i > such that, JCs^ = {s G 
S,d{s,JC) < 6i} C S and 

sup \Vl£{s + h;9 (s)) - Vli{s; 9 (s))| < e , 

|h|<5i,sG/C 

where the set /C is defined in ([^5]) . Under the stated assumption lim^^oo d(sn, >C) = a.s., which 
implies that 

lim h(s„) = lim {E[s{Yn+i;9 {sn))\J^n] - = , a.s. 

n—foo n—>oo 

By combining the latter two equations, we therefore obtain 
limsupPf Vli{sn;9)\., - Vll{s;9) 



\{s,e)={E[s{Y„ 



>e < 



limsupP(s„ /C) +limsupP(|h(s„)| > Sq) < e 



showing (150]) . 



□ 



Proof of Theorem We use the definition of the recursive EM sequence given by Proposition [6l 

The mean field associated to this sequence is given by h.g{9) —I~^{9)'VoK{7r\\gg). The Jaco- 
bian of this vector field at 9* is equal to H{9*). Since 9* is a (possibly local) minimum of the 
Kullback-Leibler divergence, the matrix VgK (vr ) is positive definite. Because the two ma- 
trices I~^{9*) VgK (vr ||(?0)|g^gi* and I^^^'^{9*) V^K (vr ||5'6»)|g^0* In^^'^{9*) have the same eigenvalues, 
counting multiplicities, the eigenvalues of the matrix H{9*) = I~^{9*) V^K (tt H^e ) are all real 
and strictly positive. This shows th e first assertion of the theorem; the proof of the second assertion 
follows directly from Pelletier . 1998I . Theorem 1. □ 
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